Install and load in the libraries and data we need for this section:
# Set your working directory by clicking on the top menu:
# Session > Set Working Directory > To Source File Location
# Install packages
install.packages("dplyr")
# Load in libraries
library(dplyr)
# If you are want to read the information for a function, type 1 question mark in front of the function name:
?read.csv
# If you are want to know which package a function belongs to, type 2 question marks in front of the function name:
??read.csv
# Load in data
raw_data <- read.csv("data/raw_data.csv") During this workshop, we will use functions available in the dplyr package to subset and summarise data.
dplyr functions can use the %>% (pipe) operator to chain together objects/functions. This passes the output of one function directly into the next. It can be helpful to ‘stack’ multiple functions without creating multiple visible outputs. You’ll see this in use in the following examples.
Subsetting is commonly used in R to select data that you would like to use. The select function can be used to subset based on column name.
Compare the results of using the function normally vs. with the pipe operator:
select(raw_data, Age, Region)
raw_data %>%
select(Age, Region)The filter function can be used to subset based on variables in your data.
What do each of these lines of code filter the data for?
raw_data %>%
filter(Region == "Mara")
raw_data %>%
filter(Age >= 30)These outputs can be saved as an object, exactly as you normally would.
Store your output table as an object:
# Save output as an object
subsetted_data <- raw_data %>%
filter(Region == "Mara")
# Print table
subsetted_dataSometimes, you may want to work with summaries of your data. The summarise function can be used to calculate summaries of variables in your data.
What do each of the following filters summarise?
raw_data %>%
summarise(n_males = length(which(Sex=="M")))
raw_data %>%
summarise(total_age = sum(Age))As mentioned earlier, dplyr functions can be stacked using the %>% (pipe) operator. For example, the summarise function can be combined with group_by to summarise variables by one or more columns.
How are these two tables different?
raw_data %>%
group_by(Sex) %>%
summarise(n_records = length(Sex))
raw_data %>%
group_by(Region, Sex) %>%
summarise(total_age = sum(Age))Fill in the blanks for the following lines in your R script
# Subset for only records with a dog
raw_data %>%
___(species=="dog")
# Subset for humans, and summarise the mean age per region
raw_data %>%
___(species=="human") %>%
group_by(___) %>%
___(mean_age = ___(age))library(ggplot2)
library(lubridate)
library(leaflet)ggplot() +
geom_bar(data=raw_data, aes(x=sex), fill=col_palette[1]) +
theme_classic()# Set start and end dates for time series
raw_data$date <- as.Date(raw_data$date)
ts_start <- as.Date(paste0(substr(min(raw_data$date),1,7), "-01"))
ts_end <- ceiling_date(max(raw_data$date),'month')
ts_breaks <- seq(ts_start, ts_end, by="month")
# Subset data
male_data <- raw_data %>% filter(sex=="M")
female_data <- raw_data %>% filter(sex=="F")
# Use histogram function to summarise numbers for each month
ts_male <- hist(male_data$date, plot=FALSE, breaks=ts_breaks)
ts_female <- hist(female_data$date, plot=FALSE, breaks=ts_breaks)
# Create a data frame containing the time series data
ts_data <- data.frame(date = rep(ts_breaks[1:length(ts_breaks)-1], 2),
sex = c(rep("Male", length(ts_male$counts)),
rep("Female", length(ts_female$counts))),
n = c(ts_male$counts, ts_female$counts))
# Plot
ggplot() +
geom_col(data=ts_data, aes(x=date, y=n, fill=sex)) +
labs(x="Date", y="Number") +
scale_fill_manual(name="Gender", values=col_palette[1:2]) +
theme_classic()# Create a new column specifying domestic vs. wildlife vs. human
raw_data$species_type[which(raw_data$species=="dog" | raw_data$species=="cat")] <- "Domestic"
raw_data$species_type[which(raw_data$species=="jackal" | raw_data$species=="lion")] <- "Wildlife"
raw_data$species_type[which(raw_data$species=="human")] <- "Human"
# Use only one year
leaflet_data <- raw_data %>%
mutate(year = substr(date, 1,4)) %>%
filter(year == 2014)
# Setup point colours using the colorFactor() function
leaflet_pal <- colorFactor(palette=col_palette[1:3], domain = unique(leaflet_data$species_type))
# Plot
leaflet() %>%
addPolygons(data=region_shp, weight=1, color="black", fillColor = "white", fillOpacity=1) %>%
addCircleMarkers(data=leaflet_data, lng=~x, lat=~y, color=~leaflet_pal(species_type),
radius=3, opacity = 1, fillOpacity=1, label=~species)